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ABSTRACT 


One  of  the  major  components  in  a  hardware  implementation  of  the  dis¬ 
crete  Fourier  transform  (DFT)  is  the  multiplier  hardware.  There  are 
algorithms  available  which  reduce  the  number  of  multiplications  used  in 
computing  the  DFT,  and  there  are  hardware  techniques  for  implementing 
multipliers.  This  paper  presents  an  efficient  method  of  using  table 
lookup  multipliers  for  the  fast  Fourier  transform  ( F FT ) .  This  method 
implements  the  multipliers  with  a  small  number  of  modest  size  tables. 


Ac..  .  . 

'  ( 

K.ii.,  X 

H 

LT IV  ’  ;  j 

6ft  ffn! 

r- 

U 

•  r  *  •  / 
s  ...  — 

*  »  . 

• 

Dist 

ft 

A  ...11  ftl:d/or 
special 

I.  INTRODUCTION 


Filtering  and  spectral  analysis  are  frequently  required  tasks  In  sig¬ 
nal  processing.  In  digital  signal  processing,  these  tasks  are  often 
Implemented  using  the  Discrete  Fourier  Transform  (DFT),  or  more  specifi¬ 
cally  the  fast  Fourier  transform  (FFT)  algorithm. 

Radar  signal  processing  provides  several  applications  for  the  FFT. 
For  example,  the  FFT  can  be  used  in  conjunction  with  a  pulse  compressor 
to  provide  pulse-Doppler  processing  {1]  as  shown  in  Figure  1.  In 
pulse-Doppler  processing,  the  echos  of  the  transmitted  pulses  are  first 
compressed  by  a  matched  filter.  Then,  for  each  range  cell  or  time 
delay,  the  FFT  Is  applied  to  the  echo  of  many  transmitted  pulses.  The 
phase  progression  across  these  echos  shows  up  as  a  Doppler  shift  in  the 
FFT  output. 

Another  application  of  the  FFT  Is  the  implementation  of  digital  fil¬ 
ters.  It  Is  well  known  12]  that  the  convolution  of  data  with  a  filter 
can  be  Implemented  by  taking  the  FFT  of  both  the  data  and  the  filter, 
multiplying  the  two  results,  and  taking  an  1  ..verse  FFT.  Frequently, 
this  process  takes  less  computation  Chan  implementing  the  filter 
directly. 

t 

The  main  components  used  in  building  FFT  hardware  are  adders,  multi¬ 
pliers,  and  memory.  Traditionally,  the  multipliers  have  been  considered 
to  be  the  most  expensive  component  in  terms  of  speed,  amount  of  hardware 


required,  and  power  consumption.  It  is  not  surprising  then  that  much 
research  has  been  done  to  develop  transform  algorithms  which  reduce  the 
number  of  multiplications  (3, 4, 5, 6, 7]. 

This  paper  presents  several  techniques  which  may  be  combined  in  order 
to  implement  the  multiplications  used  in  the  DFT  as  a  table  lookup  pro¬ 
cess,  thus  trading  memory  for  multiplier  hardware.  Using  memories  for 
multipliers  in  the  FFT  has  been  suggested  before  [8],  although  the  tech¬ 
nique  is  quite  different  from  that  presented  here. 
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II.  COEFFICIENT  FACTORIZATION 

The  main  idea  of  the  technique  presented  in  this  paper  is  to  factor 
the  coefficients  used  in  the  FFT  multipliers  in  such  a  way  that  only  a 
few  distinct  multiplier  coefficients  are  used.  In  order  to  demonstrate 
this  factorization,  we  first  examine  multiplication  by  powers  of  roots 
of  unity,  then  apply  these  results  to  the  FFT. 

Let  W  be  an  N-th  root  of  unity.  It  is  clear  that  there  are  only  N 
different  integral  powers  of  W;  W0,...,  WN-1.  To  understand  the  factor¬ 
ization,  suppose  that  we  wish  to  multiply  a  data  clement  by  Wn  for  some 
n**0, . . .  ,N-1.  We  can  write  the  binary  decomposition  of  n  as 


n  “  b  ^  2  ^  ^  “F.  •  •  "F  b^2^ 


(1) 


where 


b^  e  {0,1} 


(2a) 


and 


4  >  log2N  >  Jt-1. 


(2b) 


Making  use  of  eq.  (I)  we  easily  have 

...(/> 

so  that  all  N  powers  of  W  can  be  formed  from  the 

2* 

W  ,  i-0,...,  .8,-1 


(3) 


basic  terms 


(4) 
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Since  A  varies  with  log2N,  there  is  a  significant  reduction  in  the  num¬ 
ber  of  distinct  multipliers  required  to  Implement  multiplication  by  pow¬ 
ers  of  W  as  compared  with  a  brute-force  approach.  This  factorization 
technique  Is  similar  to  a  well  know  technique  for  exponentiation  [9,  p. 
398]. 

The  multiplications  which  occur  In  the  FFT  are  multiplications  by 
integral  powers  of  an  N-th  root  of  unity.  Thus,  the  above  discussion 
applies  to  multiplications  within  the  FFT.  In  particular,  the  decompo¬ 
sition  in  eq.  (3)  suggests  a  particular  construction  for  the  multipliers 
in  a  FFT  hardware  realization,  as  shown  in  Figure  2.  The  multiplier  is 
constructed  as  having  A  stages,  and  the  data  to  be  multiplied  flows 

from  stage  to  stage  in  a  pipeline  fashion.  At  the  i-th  stage  from  the 

21 

right  (for  1=0,...,  A-l)  the  data  is  either  multiplied  by  W  (when 
bj-1)  or  bypasses  the  stage  (when  b^O). 

The  next  step  in  the  design  of  the  FFT  multiplier  i-  to  implement  the 
individual  stages.  The  stages  corresponding  to  \  -  -.-2,  -1  have  multi¬ 

pliers  of  — j  and  -1,  both  of  which  are  vorv  simple  and  require  only  a 
small  amount  of  hardware.  For  the  smaller  values  of  i,  this  implementa¬ 
tion  may  be  done  ns  shown  in  Figure  3.  The  multiplication  by  the  com- 
21 

plex  number  W  is  shown  as  four  real  multiplications  followed  by  two 
additions.  Another  arrangement  for  the  complex  multiplier  is  possible 
which  uses  three  adders  and  three  real  multipliers. 


Each  of  the  real  multiplications  in  Figure  3  is  multiplication  by  a 
constant  and  can  be  implemented  as  a  table  lookup  using  Head  Only  Memory 
(ROM).  For  10  bit  arithmetic,  each  ROM  would  require  10  24  words  or 


10,240  bits.  However,  large  wordlengths  pose  a  problem.  For  example, 
16  bit  arithmetic  requires  65,536  words  or  1M  bit,  which  is  quite  largel 
There  is,  however,  a  solution  to  this  problem,  bong  wordlengths  may  be 
divided  into  a  most  significant  half  (MSH)  and  a  least  significant  half 
(LSH).  Each  half  can  be  multiplied  separately,  and  the  results  combined 
with  an  adder.  Figure  4  shows  this  double  precision  technique  for  16 
bit  arithmetic,  using  only  6144  bits  of  ROM  plus  an  adder.  The  ROM  for 
the  LSH  is  smaller  than  that  for  the  MSH,  since  the  LSH  of  the  input 
only  affects  the  LSH  of  the  output. 

There  are  numerous  variations  on  the  multiplier  design  just  men¬ 
tioned.  For  example,  an  alternative  to  bypassing  the  multiplications, 
aa  shown  in  Figure  2,  is  to  multiply  the  data  by  1  when  b^-0.  This 
selection  of  the  multiplier  constant  can  be  accomplished  by  using  b^  as 
an  address  bit  for  the  tables,  so  that  effectively  a  different  set  of 
tables  is  selected  for  b^-0  and  b^-1. 

This  alternative  to  bypassing  nicely  lends  Itself  to  combining  sev¬ 
eral  of  the  multiplier  stages  in  Figure  2  into  a  single  stage  by  using 
the  bits  of  the  exponent  for  W  as  address  bits  for  the  tables.  The 
result  of  thl3  combination  is  to  Increase  the  table  sizes  while  decreas¬ 
ing  the  number  of  adders  and  bypasses.  Thus,  when  using  commercially 
available  components,  excess  amounts  of  "leftover"  memory  may  be  uti¬ 
lized  in  an  efficient  manner  which  reduces  hardware.  This  combining  of 
stages  also  helps  reduce  computational  noise  as  mentioned  In  the  next 


section. 


Application  of  the  above  multiplier  design  to  FFT  pipelines  yields 
additional  advantages*  Consider  a  radix  2  FFT  pipeline  for  decimation 
in  frequency  [10].  At  the  first  multiplier,  all  bits  of  the  exponent 
for  W  are  required.  However,  at  the  second  multiplier  the  least  signi¬ 
ficant  bit  is  always  zero.  At  the  third  multiplier,  the  least  signifi¬ 
cant  two  bits  are  always  zero,  and  so  on  down  the  pipeline.  Thus,  a3  we 
go  down  the  pipeline,  each  multiplier  requires  one  fewer  stage  than  the 
previous  multiplier,  allowing  a  reduction  in  hardware.  A  radix  4  FFT 
pipeline  is  similar,  dropping  two  stages  in  each  successive  multiplier. 


III.  COMPUTATION  NOISE 


A  disadvantage  of  the  table  lookup  multiplier  scheme  described  in  the 

previous  section  is  that  each  stage  within  the  multiplier  (except  the 

stages  for  multiplication  by  -j  and  -1)  introduces  some  computational 
noise.  Thus,  the  computational  noise  is  worse  for  this  method  than  if  a 

standard  multiplier  were  used.  It  can  be  shown,  however,  that  the  addi¬ 
tional  noise  is  small. 

As  an  example,  consider  a  radix  2  pipeline  using  L  bit  integer  arith¬ 
metic.  It  will  be  assumed  that  at  the  outputs  of  each  butterfly,  the 
data  is  divided  by  2  to  avoid  overflows.  Following  an  analysis  similar 
to  that  in  [2],  it  can  be  shown  that  for  an  FFT  implementation  using  a 
standard  multiplier  the  signal-to-noise  ratio  due  to  computational  noise 
is  given  by 


SNRg=s 


_1 

7N 


(5) 


A  similar  result  can  be  derived  for  the  table  lookup  implementation. 
If  there  are  M  guard  bits  internal  to  each  stage  of  the  table  lookup 
multipliers,  then  for  the  table  lookup  scheme  it  can  be  shown  that  the 
signal-to-noise  ratio  due  to  computational  noise  is  approximately 


SNR 

T 


1+3-2 


-2M 


(6) 
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where  we  have  made  use  of  the  decreasing  number  of  stages  within  the 
pipeline  multipliers.  For  the  case  of  2  guard  bits  (M=2)  the  difference 
between  eqs.  (S)  and  (6)  corresponds  to  3.09  db. 


Variations  on  this  analysis  are,  of  course,  possible.  For  example, 
when  several  multiplier  stages  are  combined  into  a  single  stage  as  men¬ 
tioned  in  the  previous  section,  there  are  fewer  stages  to  introduce  com¬ 
putational  noise  and  SNR^  is  thus  closer  in  value  to  SNRg.  Similar 
results  are  obtained  for  radix  4  implementations  except  that  SNR  and 
SNRt  are  greater  than  the  radix  2  cases,  due  to  a  smaller  number  of  but¬ 
terflies  . 

In  order  to  understand  the  effect  of  computational  noise,  it  is 
necessary  to  know  the  ideal  signal-to-noise  ratio  which  would  be 
obtained  if  infinite  precision  arithmetic  were  used.  In  many  applica¬ 
tions,  the  ideal  SNR  can  be  thought  of  as  the  SNR  of  the  input  data  plus 
the  processing  gain.  When  the  ideal  SNR  is  significantly  less  than  the 
SNR  due  to  computational  noise,  the  computational  noise  does  not  signi¬ 
ficantly  affect  the  result. 

Taking  into  consideration  the  ideal  SNR,  we  show  in  Figure  5  a  plot 
of  the  output  SNR  of  the  standard  and  table  lookup  multiplier  implemen- 

I 

tations  for  a  512  point  FFT  using  16  bit  arithmetic,  with  2  guard  bits 
(M"2)  in  the  table  lookup  case.  It  is  clear  that,  in  this  case,  for  an 
ideal  SNR  of  50  db  or  less  the  computational  noise  of  either  implementa¬ 
tion  is  insignificant.  For  many  practical  applications,  the  ideal  SNR 
is  less  than  50  db.  Between  50  db  and  60  db  the  difference  in  the  t'O 
SNR's  becomes  apparent.  Similar  curves  are  obtained  for  other  transform 
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lengths  and  wordlengths.  For  shorter  transform  lengths  the  breakpoint 
between  the  two  curves  occurs  at  a  higher  db  level,  and  for  shorter 
wordlengths  the  breakpoint  occurs  at  a  lower  db  level. 
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IV.  SIZING 


A  preliminary  sizing  study  lias  been  done  in  order  to  compare  the 
hardware  required  for  the  table  lookup  scheme  wiLh  the  hardware  required 
in  a  standard  multiplier  implementation.  In  both  cases,  a  pipelined 
radix  2  FFT  is  designed  with  off-the-shelf  components.  Ihe  amounts  of 
hardware  used  by  these  designs  is  shown  in  Table  1  for  transform  lengths 
of  32  to  128  and  vovdlengths  of  12  and  16  bits.  In  all  cases,  the  table 
lookup  approach  uses  slightly  more  hardware  than  the  standard  multiplier 
approach,  but  typically  has  a  higher  data  rate. 

In  order  to  compare  the  various  designs,  we  have  defined  a 
f igure-of-merit  as 

Transform  Length  x  Data  Rate  (MHz) 

Merit  - — — — ' - _____  (7) 

IC  Count 

where  the  IC  Count  is  the  equivalent  number  of  1  (>  pin  integrated  cir¬ 
cuits  used  in  the  implementation.  This  definition  of  merit  takes  into 
account  the  ability  of  a  design  to  process  more  data  by  using  a  longer 
transform,  and  to  process  more  data  in  a  given  amount,  of  time  by  using  a 
higher  data  rate.  This  definition  also  takes  into  account  the  amount  of 
hardware  required  so  that  a  design  using  more  hardware  is  given  a  lower 
merit.  Wordlength  is  not  includ.d  in  the  definition  in  eq.  ( 7 1  since 
little  is  gained  *  r  excessively  long  word i eng t hs . 

Figure  6  show.-  tin-  t  igure-of-merit  tor  the  two  implementations  as  a 
function  of  t  ran.;!  •  •  an  length.  The  two  method-  are  seen  to  he  verv  com¬ 
parable  In  their  performance. 


These  estimates  were  based  on  using  commercially  available  compo¬ 


nents.  However,  il  the  designs  were  to  use  slightly  customized  inte¬ 
grated  circuits  of  the  same  level  of  technology  as  those  used  in  the 
designs  discussed  above,  the  hardware  requirements  of  both  the  standard 
and  table  lookup  techniques  would  be  reduced.  In  particular,  the  table 
lookup  method  could  benefit  more  than  the  standard  method.  This  differ¬ 
ence  is  illustrated  by  noting  that,  for  example,  the  table  lookup  method 
uses  more  integrated  circuits  as  data  latches.  These  latches  could 
easily  be  incorporated  into  the  outputs  of  memories,  adders,  and  other 
IC's.  Table  2  shows  the  custom  hardware  requirements  for  the  table  loo¬ 
kup  and  standard  implementations.  In  most  cases  the  table  lookup  method 
requires  less  hardware  than  the  standard  method,  which  when  combined 
with  the  higher  < it  a  rate  of  the  tables  gives  the  table  lookup  method  a 
distinct  advanin •>.•.  It  Is  also  believed  that  advances  in  memory  tech¬ 
nology  and  very  large  scale  integration  will  further  shift  the  balance 
In  favor  of  the  tab^e  lookup  method. 


V.  CONCLUSIONS 


In  this  papt'r,  w t  have  presented  an  efficient  means  by  which  table 
lookup  multipliers  may  be  used  in  the  FFT.  Preliminary  sizing  estimates 
indicate  that  by  using  off-the-shelf  components,  the  table  lookup  imple¬ 
mentation  is  comparable  in  performance  to  other  FFT  multiplier  techni¬ 
ques.  It  is  expected,  however,  that  advances  in  integrated  circuit 
technology  will  shift  the  balance  towards  smaller,  faster,  cheaper  memo¬ 
ries,  making  the  table  lookup  multipliers  very  attractive.  It  is  also 
expected  that  Very  Large  Scale  Integration  (VLSI)  will  allow  the  combi¬ 
nation  of  many  small  components  of  the  tabic  multipliers  into  a  single 
integrated  circuit,  making  construction  of  table  lookup  based  FFT  struc¬ 
tures  an  easier  task  than  using  existing  components. 
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Figure  1.  An  application  of  the  FFT  to  radar. 
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Figure  3.  Single  stage  of  the  multiplier. 
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12-BIT 
WORDS 
( 18  MHz) 

16-BIT 
WORDS 
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16 
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.'76 
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32 
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